Overview

sniper.gif pubg_map.jpg
Left: A skilled sniper taking out a moving target. Right: A PUBG map.

Motivation

Battle royale games have surged in popularity in recent years. The premise of such games is as follows: players are dropped onto a fictional island and fight to be the last person standing. As they roam around the island, they loot for weapons and items crucial for their survival. Players can choose to join a game as a solo player or with a group of friends (4 players maximum). When playing solo, players are immediately eliminated when they are killed. However, in group play, killed individuals can be revived by their teammates.

We are interested in building a prediction model for the popular battle royale game PUBG (PlayerUnknown’s Battlegrounds). In PUBG, players not only have to worry about getting killed by other players, but they also have to stay within the shrinking “safe zone,” which effectively forces players into contact with each other. Outside of the “safe zone,” players take damage to their health at increasing rates.

Through our analysis, we aim to understand what characterizes winning players or teams: How aggressive are the playing styles of the winners? Is it better to land in a densely or sparsely populated area? Do players who travel farther on the map tend to place higher or lower? Answers to such questions will be of high interest for the PUBG gaming community.

Initial Questions

The main goal of this project is to predict a player’s finish placement based on their in-game actions. Specifically, the three subquestions of interest are:

  1. Model Prediction Accuracy: How well can we predict a player’s finish placement?
  2. Feature Ranking: What player actions or statistics are most predictive of their finish placement?
  3. Clustering: Which playing styles are most successful?

Data

The data comes from the Kaggle competition.

  • For TA use: Run the code chunk below to download the data from the Dropbox link. This link is guaranteed to be available only during the grading period.
  • For everyone else: Join the Kaggle competition and run the shell script download_data.sh.
data.url <- paste0("https://www.dropbox.com/s/319vkfevkfb6kqt/all.zip?dl=1")

if(!file.exists("./data/train_V2.csv.zip")){
  download.file(data.url, destfile = "./pubg.zip", mode = "wb")
  unzip("./pubg.zip", exdir = "./pubg")
  file.rename("./pubg", "./data")
  file.remove("./pubg.zip")
}
# Warning: Large dataset (628 MB), will take a minute or so to read.
raw_dat <- read_csv("./data/train_V2.csv.zip")

clean_dat = raw_dat %>%
  clean_names() %>%
  drop_na(win_place_perc) # Drop rows without outcome variable

Variables

Each row in the data contains one player’s post-game stats. A description of all data fields is provided in data/pubg_codebook.csv. We will focus on the solo game mode (match_type is solo, solo-fpp, or normal-solo-fpp). The solo game mode constitutes about 16% of the data, with 720,386 observations. The outcome variable we are trying to predict is win_place_perc.

solo_dat <- clean_dat %>% 
  #sample_n(10000) %>%
  filter(match_type %in% c("solo", "solo-fpp", "normal-solo-fpp")) %>%
  select(-dbn_os, -assists, -revives, -group_id, -match_type, -team_kills) %>%        #   Remove features that are not relevant to single-players
  mutate(kill_points = ifelse(rank_points == -1 & kill_points == 0, NA, kill_points), # Following codebook explanations
         win_points = ifelse(rank_points == -1 & win_points == 0, NA, win_points), 
         rank_points = ifelse(rank_points == -1, NA, rank_points)) %>%
  mutate(id = as.factor(id), match_id = as.factor(match_id))

Training and Test Set

We are given a training set and a test set. The outcome variable for the test set will not be provided until the end of the Kaggle competition in Jan. 30th, 2019. Therefore, for the purposes of this project, we will only be using the provided training set. Within the provided training set, we will create our own “training” (80%) and “test” set (20%). For the rest of the document, the training set we refer to is the one we’ve created.

# Split into train and test set
train_ind = createDataPartition(y = solo_dat$win_place_perc, p = 0.8, list = F)
train_solo = solo_dat %>%
  slice(train_ind)
test_solo = solo_dat %>%
  slice(-train_ind)

Exploratory Data Analysis

In the training set, we have 576310 players and 8071 matches.

# Compute proportions
prop_data = train_solo %>% 
  group_by(match_id, max_place, match_duration) %>% 
  count() %>%
  ungroup() %>%
  mutate(prop = n/max_place,
         remove_game = prop > 1)

# Games with proportion greater than 100%
prop_over_100 = prop_data %>%
  summarize(prop_n = sum(prop > 1),
            prop_games = prop_n/n())

# Histogram
prop_data %>%
  ggplot(aes(x = prop)) + 
  geom_histogram(bins = 30) +
  labs(title = "Proportion of players we have data for in a game",
       x = "Proportion",
       y = "Count") +
  theme_minimal()

# Remove games with proportion greater than 100%
remove_match_ids = prop_data %>%
  filter(remove_game) %>%
  pull(match_id)

train_solo = train_solo %>%
  filter(!(match_id %in% remove_match_ids))

For most games, we have between 70% to 90% of the players’ data, using max_place (the worst placement for which we have data) as a proxy for total number of players. For 14 games (0.17% of all games), we have more observations than max_place, which is not possible. Thus, we exclude these games from our analysis.

Distribution of Features by Finish Percentile

We first explored the distribution of each feature by the final finish percentile. Players were first grouped into the 0-19th, 20th-39th, 40th-59th, 60th-79th, or 80th-100th percentile finish. Then we plotted the density of features by percentile groups. Note that due to extreme outliers, we excluded the highest 1% of many of the features for clearer visualizations.

set.seed(1)
filter_vars = c("boosts", "damage_dealt", "headshot_kills", "heals", "kills", "longest_kill", "ride_distance", "swim_distance", "walk_distance", "weapons_acquired")
train_solo %>% 
  filter_at(vars(filter_vars), all_vars(. < quantile(., 0.99, na.rm = T))) %>%   # Remove outliers
  rename_at(vars(filter_vars), ~ sapply(filter_vars, function(str) paste0(str, "*"))) %>% # Mark variables for which we removed outliers with asterisk
  select(-max_place, -num_groups) %>%                           # Not interested in these features
  mutate(win_place_cat = floor(win_place_perc / 0.2),
         win_place_cat = ifelse(win_place_cat == 5, 4, win_place_cat),
         win_place_cat = as.factor(win_place_cat)) %>%
  gather("feature", "value", -match_id, -match_duration, 
         -id, -win_place_perc, -win_place_cat) %>%
  ggplot(aes(x = value, group = win_place_cat, color = win_place_cat)) +
  facet_wrap(feature ~., scales = "free") +
  geom_density() +
  labs(title = "Distribution of Features by Finish Percentile", 
       caption = "* Removed outliers (> 99th percentile) from this feature's density plot",
       x = "Value of Features", y = "Density", color = "Percentile") +
  scale_color_manual(labels = c("0-19", "20-39", "40-59", "60-79", "80-100"),
                     values = brewer.pal(5, "OrRd")) +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Note: Need to resolve warning (non-finite values)

Some interesting relationships between the features and the finish percentile:

  • Use of Items (boosts, heals, and weapons_acquired): Players who finish higher tend to have used more boosts and healing items, and acquired more weapons. This is expected since they stayed in the game for a longer period and have more time to collect and use items. However, it would be interesting to explore which of these variables is most predictive of a high finish placement.

  • Kills & Damage (damage_dealt, kill_place, and kills): Players who finish higher tend to have more kills. They also tend to have dealt more damage. However, in the top finishing group, there is a wide variety in how much damage they inflict. This could potentially indicate strategies that differ in their level of aggressiveness during the course of the game but are similarly successful in achieving a high placement.

  • Distance Traveled (walk_distance, swim_distance, and ride_distance): Players who finish higher tend to have walked farther. This is likely because they simply survive longer and are force to travel to stay in the safe zone, whereas players who die early don’t get a chance to travel very far. Both swimming and riding in vehicles are rare occurrences, though it appears that players who finish higher also tend to do more of both.

(Below is what you wrote originally, Athena, in case you want to retain something):

  • The graph of boost and heals used suggests that players who use more boosts or healing items are likely to last longer in the game. This makes intuitive sense as boosts enable players to have increased passive health regeneration and movement speed, and healing items regain health.
  • For damage_dealt, we see similar differences among players by their finish percentile. Winner (e.g. 100th percentile) have a broad distribution in damage dealt suggesting that some solo players may win by not having high damage dealt while other deal significantly more damage. It is important to note that winners must have killed at least one individual. Thus, it is expected that the damage dealt distribution for winners is shifted to the right in comparison to players with lower finish percentiles.
  • kill_place, kill_points, kills, and win_points follow bimodal distributions. This may reflect the play-styles of each player. Players who land in populated areas are more likely to encounter other players, resulting in a higher porbability of dying or a larger number of kills if the player survives. Thus, we can partition players in the 10th percentile finish into two categories: a skilled player who but dies early due to dropping in a populated location, but due to their skill acquires a large number of kills or a less-skilled player who dies early due to lack of skill despite dropping in a less populated location.
  • Some features look highly skewed (e.g. longest_kill, ride_distance, swim_distance, ride_distance, etc.). We may want to log-transform these variables in our model building.
  • The num_groups density plots suggest that in games where we have little data, we tend to have data on the winners. Thus, there may be some imbalance in the data we will need to either adjust for to ensure that our model doesn’t overestimate finish percentile.
  • rank_points, win_points and kill_points are external characteristics (from previous games) that attempt to characterize the skill level of a player. These distributions are bimodal which may reflect the extremes of the two playstyles described above. It seems that kill_points has more predictive value of finish percentile as the right-shift is more distinct by finish percentage category than rank_points. Interestingly, rank_points suggests that prior-game ranks do not have a large impact on the final placement in a game (though there is a note int the pubg_codebook.csv file that this metric is deprecated). This makes sense since in-game variables like drop location, loot, and circle movement can affect how likely an individual is to win.

Correlation Plot

Statistics related to kills seem to be well correlated with finish percentile. Additional duration of game does not seem to be strongly correlated with many of the in-game features such as kills, walk_distance, etc.

corr_matrix = train_solo %>% 
  select(-id, -match_id) %>% 
  cor()

corrplot(corr_matrix, method = "color", type = "upper")

Data Analysis

Models

Narrative and Summary